[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

happierpig · 2025-03-21T21:28:32Z

Description

This PR is a follow-up to #858, which integrates the PoDAttention (arXiv link) API in a user-transparent manner. Users can now invoke PoDAttention via the same API as BatchPrefillWithPagedKVCache, without explicitly specifying whether requests are prefill or decode (example code).

Key Changes

Support for Non-Continuous Q/O and KV Tensor Layout
Previously, tensor offsets were computed using indptr, assuming continuous layouts. PoDAttention requires supporting mixed prefill/decode subsets within requests, necessitating a non-continuous layout.
- Added q_lenptr and kv_lenptr to accommodate this functionality (code link).
Horizontal Fusion-Style Implementation
For improved efficiency, subsets of requests are aware of each other, enabling optimal selection of kernel hyperparameters and persistent kernel execution.
- Current resource partitioning strategy solely depends on total KV-cache load size (scheduler code).
- Note: This strategy is customizable based on specific workloads.

Limitations and Future Work

CUDA Graph is currently not supported. Only FA2 is supported at this stage.
The workload classifier (qo_len > threshold) is preliminary and requires improvement (classifier implementation).
Performance tuning is ongoing, and correctness has only been validated on a limited set of unit tests (unit tests).
cc @AKKamath @yzh119

…or PoD.

yzh119 · 2025-03-21T22:22:45Z

Some of the unittests failed, for example (test_block_sparse_attention[False-256-16-16-128-64-16-4])

RuntimeError: Error in function 'PrefillSplitQOKVIndptr' at /workspace/flashinfer/data/include/flashinfer/attention/scheduler.cuh:515: kv_len_ptr_h[0]: 0 should be positive

Edenzzzz · 2025-04-09T20:35:46Z

Hi, can I ask when this is planned to be merged? I made a PR to support POD Attn in SGLang using the old API and plan to get that working with CUDA graph first.
sgl-project/sglang#5169

AKKamath · 2025-04-09T22:12:31Z

I really like the uniform batch API that this PR presents.

I ran this on an A100 and compared it with the existing FlashInfer POD-Attention implementation. On average this performed around 10 - 15% worse, but still better than serial execution. Performance was worse for larger prefill context lengths, while for smaller context lengths the performance was more comparable.

Edenzzzz · 2025-04-09T22:29:37Z

Yeah this is more convenient, one issue i had during my PR is that I have to fill 2D attention mask for prefill every time, instead using page table & indices

yzh119 · 2025-04-10T01:05:33Z

Hi @Edenzzzz @AKKamath , I'm working on another branch following this idea, it will be merged these days.

Edenzzzz · 2025-04-10T01:58:06Z

Will the old API be preserved? Thanks.

Edenzzzz · 2025-04-10T19:22:39Z

@AKKamath Btw, I wonder what was the reason for using a mask instead of page table for prefill qkv?

AKKamath · 2025-04-10T20:01:20Z

@yzh119 Can correct me here, but I believe the mask prefill kernel (single_prefill) had a better performance than the page table prefill because the page table prefill had a higher register usage causing register spills.

Edenzzzz · 2025-04-12T03:52:55Z

But don't we waste lots of space storing the 2D mask? For example, the default shape is 2D cumulative seq lens (qo_lens, kv_lens), but when converting from page table qo_indptr, kv_indptr to the mask it will be very sparse, with each qo related to only a few kv entries of the request in the whole cumulative sequence. It can also be expensive to fill the mask

Edenzzzz · 2025-04-13T04:20:42Z

Actually I realized POD Attention is not designed to mix many prefill requests with decode requests, it just mixes one prefill at a time, so that we can use causal without any custom mask

yzh119 · 2025-04-20T00:49:23Z

Follow up in #1026 .

Edenzzzz · 2025-04-22T17:10:50Z

include/flashinfer/attention/scheduler.cuh

+        std::accumulate(qo_len_ptr_h_p.begin(), qo_len_ptr_h_p.end(), 0) +
+        2 * page_size * std::accumulate(kv_len_ptr_h_p.begin(), kv_len_ptr_h_p.end(), 0);


Hi, I'm interested in implementing a persistent POD Attn and have some questions. Here why don't we do qo_len_ptr_h_p[i] * kv_len_ptr_h_p[i] * 2 to model the quadratic compute load? Thanks.

I am using current calculation mainly for modeling memory load instead of compute load. For different workloads, this calculation can have different best heuristics. It will be helpful if you do benchmarking and decide.

BTW, #1026 will be the upstream version and this PR has been deprecated. It would be helpful if you could refer directly to the new PR.

Thanks. Do you have any plans to adapt POD Attn to the persistent template? I also plan to work on that

happierpig added 11 commits March 19, 2025 00:52

[Minor] change qo_indptr into qo_start_idx + qo_len_ptr before refact…

1266dbe

…or PoD.

[Minor] add qo_len_ptr into SplitQK schedule

9c92244

[Minor] add kv_len_ptr in paged_kv_t

3fe3355

[Major] refactor without test

843b7d1

[Major] add pod jit

00216e8

[fix] fix typo

782493e

fix

bbfae99

remove GPU kernel-scheduler

21d867b

add unit test

6925f79

clean code.

74861a6

upd

9f51efe

happierpig requested a review from yzh119 March 21, 2025 21:28

fix

2589997

Edenzzzz mentioned this pull request Apr 13, 2025

[Feature] Support better Prefill-Decode Colocation via POD-Attention sgl-project/sglang#5169

Draft

11 tasks

yzh119 mentioned this pull request Apr 20, 2025

feat: Holistic persistent kernel template with global scheduler #1026

Open

yzh119 closed this Apr 20, 2025

Edenzzzz reviewed Apr 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

Uh oh!

happierpig commented Mar 21, 2025

Uh oh!

yzh119 commented Mar 21, 2025

Uh oh!

Edenzzzz commented Apr 9, 2025 •

edited

Loading

Uh oh!

AKKamath commented Apr 9, 2025

Uh oh!

Edenzzzz commented Apr 9, 2025

Uh oh!

yzh119 commented Apr 10, 2025

Uh oh!

Edenzzzz commented Apr 10, 2025

Uh oh!

Edenzzzz commented Apr 10, 2025

Uh oh!

AKKamath commented Apr 10, 2025

Uh oh!

Edenzzzz commented Apr 12, 2025 •

edited

Loading

Uh oh!

Edenzzzz commented Apr 13, 2025

Uh oh!

yzh119 commented Apr 20, 2025

Uh oh!

Edenzzzz Apr 22, 2025

Uh oh!

happierpig Apr 22, 2025

Uh oh!

happierpig Apr 22, 2025

Uh oh!

Edenzzzz Apr 22, 2025

Uh oh!

Uh oh!

		std::accumulate(qo_len_ptr_h_p.begin(), qo_len_ptr_h_p.end(), 0) +
		2 * page_size * std::accumulate(kv_len_ptr_h_p.begin(), kv_len_ptr_h_p.end(), 0);

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

Uh oh!

Conversation

happierpig commented Mar 21, 2025

Description

Key Changes

Limitations and Future Work

Uh oh!

yzh119 commented Mar 21, 2025

Uh oh!

Edenzzzz commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AKKamath commented Apr 9, 2025

Uh oh!

Edenzzzz commented Apr 9, 2025

Uh oh!

yzh119 commented Apr 10, 2025

Uh oh!

Edenzzzz commented Apr 10, 2025

Uh oh!

Edenzzzz commented Apr 10, 2025

Uh oh!

AKKamath commented Apr 10, 2025

Uh oh!

Edenzzzz commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edenzzzz commented Apr 13, 2025

Uh oh!

yzh119 commented Apr 20, 2025

Uh oh!

Edenzzzz Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

happierpig Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

happierpig Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Edenzzzz commented Apr 9, 2025 •

edited

Loading

Edenzzzz commented Apr 12, 2025 •

edited

Loading